Applied Computational Genomics

4/4/2024

An introduction to GLMs

Linear models are powerful, but limited


In a simple linear model, we model a response \(Y\) as a function of some inputs \(X\) (our independent variables), along with an error term \(\epsilon\).

\[Y = \beta_0 + \beta_{1}X_1 + ... + \beta_{p}X_p + \epsilon\]

\(Y\) : our response variable

\(\beta_0\) : the intercept term

\(\beta_p\) : the coefficient associated with independent variable \(X_p\)

\(\epsilon\) : error (residual) term

We make a few assumptions when we fit linear models

Linearity: can the relationship between \(X\) and \(Y\) be approximated by a straight line?

Independence: are the errors (residuals) independent of each other?

Homoscedasticity: is the variance of the errors (residuals) constant as a function of \(X\)?

Normality: are the errors (residuals) normally distributed?

Checking the independence assumption

Expand to see code
library(tidyquant) 
library(ggplot2)
library(cowplot)

googl <- tq_get(x = "GOOG", from = '2020-01-01')

ggplot(googl, aes(x=date, y=open)) +
    geom_line() +
    labs(x="Date", y="GOOGL opening stock price ($)") +
    theme_cowplot()

Checking the homoscedasticity assumption

Expand to see code
library(ggplot2)
library(cowplot)

set.seed(12345)

# number of observations
n <- 100 
# generate x values from 1 to n
x <- 1:n
# generate an error term, normally distributed with mean 0 and SD 4 
sd <- runif(n, min = 0, max = 4)
error <- rnorm(n, 0, sd*x) 
# generate y values
y <- (x * 10) + error 

# put x and y into dataframe
df = as.data.frame(cbind(x, y))

# plot
ggplot(df, aes(x = x, y = y)) +
    geom_point() +
    geom_smooth(method="lm", se=FALSE) + 
    theme_cowplot() + 
    labs(x = "X", y = "Y")

Checking the homoscedasticity assumption

Expand to see code
# make predictions using simple linear model
m = lm(y ~ x, data=df)
df$pred = predict(m)

# plot residuals
ggplot(df, aes(x = x, y = y)) +
    geom_point() +
    geom_smooth(method="lm", se=FALSE) + 
    geom_segment(aes(xend = x, yend = pred), color="firebrick") +
    theme_cowplot() + 
    labs(x = "X", y = "Y")

Checking the normality assumption

Expand to see code
library(HistData)
data(GaltonFamilies)

# subset to male children
galton_filt = subset(GaltonFamilies, gender=="male")

ggplot(galton_filt, aes(x=midparentHeight, y=childHeight)) +
    geom_point(alpha=0.5, size=5) +
    geom_smooth(method="lm", se=FALSE) +
    theme_cowplot() +
    labs(x="Mid-parent height", y="Child height")

Checking the normality assumption

Expand to see code
m = lm(childHeight ~ midparentHeight, data=galton_filt)
galton_filt$pred = predict(m)

ggplot(galton_filt, aes(x=midparentHeight, y=childHeight)) +
    geom_segment(aes(xend = midparentHeight, yend = pred), color="firebrick") +
    geom_point(alpha=0.5, size=5) +
    geom_smooth(method="lm", se=FALSE) +
    theme_cowplot() +
    labs(x="Mid-parent height", y="Child height")

Checking the normality assumption

Expand to see code
hist(resid(m), breaks=50, main=NULL)

We make a few assumptions when we fit linear models

Linearity: can the relationship between \(X\) and \(Y\) be approximated by a straight line?

Independence: are the \(Y\) observations independent of each other?

Homoscedasticity: is the variance of the errors (residuals) constant as a function of \(X\)?

Normality: are the errors (residuals) normally distributed?


What if these assumptions are violated?

Modeling other kinds of data

In biology, we’re often interested in modeling data that don’t conform to the assumptions of simple linear regression.

Count data

Number of DNA sequencing reads aligned to a particular genomic position

Binary data

Survival after treatment with a new therapeutic compound

Discrete categorical data

Cell type identity in a large single-cell RNA-seq experiment

To model these data types, we can turn to generalized linear models.

The basic ingredients of a GLM

  1. An expected distribution for the \(Y\) values

\[\textcolor{teal}{\mu} = \mathbb{E}(Y | X)\]

  1. A linear predictor

\[\textcolor{violet}{\eta} = \beta_0 + \beta_{1}X_1 + ... + \beta_{p}X_{p}\]

  1. A “link” function

\[\textcolor{olive}{g}(\textcolor{teal}{\mu}) = \textcolor{violet}{\eta}\] \[\textcolor{teal}{\mu} = \textcolor{olive}{g}^{-1}(\textcolor{violet}{\eta})\]

Why do we need GLMs?

Expand to see code
titanic = read.csv("https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv")

ggplot(titanic, aes(x=Fare, y=Survived)) +
    geom_point(alpha=0.1, size=5) +
    theme_cowplot() 

Whether a passenger on the Titanic survived given the amount they paid for their fare.

Why do we need GLMs?

Expand to see code
ggplot(titanic, aes(x=Fare, y=Survived)) +
    geom_point(alpha=0.1, size=5) +
    geom_smooth(method="lm") +
    theme_cowplot()

Why do we need GLMs?

Expand to see code
ggplot(titanic, aes(x=Fare, y=Survived)) +
    geom_point(alpha=0.1, size=5) +
    geom_smooth(method="glm", method.args = list(family = binomial(link="logit"))) +
    theme_cowplot()

The “canonical” link \(g(\mu)\) for a logistic regression is \(\log\left(\frac{\mu}{1 - \mu}\right)\).

If we define our linear predictor as \(\eta\), we can do \(g^{-1}(\eta) = \frac{e^{\eta}}{1 + e^{\eta}}\) to ensure that our predictions match the expected distribution.

Basic ingredients of a GLM in practice

We’ve established that we need three things in order to fit a GLM:

  • An expected distribution for the \(Y\) values (\(\mu\))

  • A linear predictor (\(\eta\))

  • A “link” between the our linear predictor and the response variable (\(g\))

When we fit a GLM in R, what does this involve?

GLM case study: de novo mutations in human genomes

New mutations occur in germline (sperm or egg) cells and can be passed on to children.

In humans, we typically identify de novo mutations by sequencing trios (two parents and their child).

Figure 1

We expect to see around 70 DNMs per child

Expand to see code
library(dplyr)

# read in DNM data
dnms = read.csv("https://raw.githubusercontent.com/quinlan-lab/ceph-dnm-manuscript/master/data/second_gen.dnms.txt", sep="\t")

# remove unphased DNMs
dnms = subset(dnms, phase != "na")

# group DNMs by sample id and phase
by_phase = dnms %>% count(new_sample_id, phase)
# group by sample alone to get totals
by_sample = dnms %>% count(new_sample_id)
by_sample$phase = "total"
merged = rbind(by_phase, by_sample)

ggplot(merged, aes(x=phase, y=n, col=phase)) +
    geom_boxplot(fill="white") +
    geom_jitter(size=2, width=0.15, alpha=0.5) +
    theme_cowplot() +
    labs(x="Parent of origin", y="Count")

De novo mutation counts in Utah families. Data from Sasani et al. (2019) eLife.

Modeling the dependence between DNM counts and parental age

We also expect de novo mutation counts to depend on parental age.

Each time a spermatogonial stem cell undergoes mitotic DNA replication, there’s an opportunity for mutations to occur.

DNA damage also accumulates in germ cells over time, leading to new mutations.


How many additional germline mutations do we expect to see with each year of paternal age?

Let’s build a GLM to find out.

De novo mutation counts increase with age

Expand to see code
library(ggplot2)
library(dplyr)
library(cowplot)

dnms = read.csv("https://raw.githubusercontent.com/quinlan-lab/ceph-dnm-manuscript/master/data/second_gen.dnms.summary.csv")

ggplot(dnms, aes(x=dad_age, y=snv_dnms)) +
    geom_point(size=5) +
    labs(x="Paternal age", y="Number of DNMs in child") +
    theme_cowplot()

De novo mutation counts observed in children as a function of paternal age at birth. Data from Sasani et al. (2019) eLife.

The Poisson distribution is useful for modeling count data

Remember the first ingredient we need to fit a GLM:

An expected distribution for the \(Y\) values (\(\mu\))

Our model needs to appropriately model the mean and variance of our \(Y\) values. Since we’re dealing with count data, the Poisson makes sense.

The Poisson distribution tells us:

\[P(Y = k) = \frac{e^{-\lambda}\lambda^k}{k!} \ \text{for} \ k = 0,1,2...\]

where \(\lambda\) is the expected number of “events” (the mean we want to model in our GLM).

Modeling count data with the Poisson

Now we need the remaining ingredients for a GLM:

A “link” \(g\) between the our linear predictor \(\eta\) and the conditional mean \(\mu\) of the response variable.

In Poisson regression, the canonical link function \(g(\mu)\) is the “log link”:

\[\log(\mu) = \beta_0 + \beta_{1}X_{1} + ... + \beta_{p}X_{p}\]

The log link makes sense for count data because it ensures that our predictions are always greater than 0.

We can fit a simple Poisson model using R


m = glm(snv_dnms ~ dad_age, family=poisson(link = "log"))

Since the log link is the default for Poisson GLMs, this is the same as:

m = glm(snv_dnms ~ dad_age, family="poisson")

Interpreting our Poisson GLM in R

Let’s look at the model summary:

Expand to see code
m = glm(snv_dnms ~ dad_age, family="poisson", data=dnms)

summary(m)

Call:
glm(formula = snv_dnms ~ dad_age, family = "poisson", data = dnms)

Deviance Residuals: 
     Min        1Q    Median        3Q       Max  
-2.26034  -0.88413   0.02955   0.72784   3.04281  

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept) 3.355454   0.083697   40.09   <2e-16 ***
dad_age     0.025808   0.002751    9.38   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for poisson family taken to be 1)

    Null deviance: 178.226  on 69  degrees of freedom
Residual deviance:  92.624  on 68  degrees of freedom
AIC: 512.22

Number of Fisher Scoring iterations: 4

Interpreting our Poisson GLM in R

One key difference is that the coefficients are on a different scale.

Recall that our Poisson GLM is:

\[\log(\mu) = \beta_0 + \beta_{1}X_{1} + ... + \beta_{p}X_{p}\]

This is the same as saying:

\[\mu = e^{\beta_0 + \beta_{1}X_{1} + ... + \beta_{p}X_{p}}\]

(applying the inverse of our link function to the linear predictor)

Interpreting our Poisson GLM in R

In our summary() output, let’s look at the coefficients:


coef(summary(m))
              Estimate  Std. Error   z value     Pr(>|z|)
(Intercept) 3.35545435 0.083696926 40.090532 0.000000e+00
dad_age     0.02580802 0.002751444  9.379809 6.609266e-21


For a unit increase in dad_age, there is a multiplicative effect of \(e^{0.02581} = 1.0261\) on the response variable (mean # of DNMs).

In other words, the number of DNMs observed in a child born to a 25-year-old father is \(1.0261\) times larger than the number of DNMs observed in a child born to a 24-year-old father.

Visualizing our Poisson model fit

Expand to see code
ggplot(dnms, aes(x=dad_age, y=snv_dnms)) +
    geom_point(size=5) +
    geom_smooth(method="glm", method.args=list(family="poisson"), fullrange=TRUE, se=TRUE) +
    labs(x="Paternal age", y="Number of DNMs") +
    theme_cowplot()

Explicitly modeling counts as rates

We only looked for mutations at genomic positions where at least 10 sequencing reads were aligned in mom, dad, and the child.

Expand to see code
library(cowplot)

dnms$callable_bp_bill = dnms$autosomal_callable_fraction / 1e9

ggplot(dnms, aes(x=factor(sample_id), y=callable_bp_bill)) +
    geom_col() +
    labs(x = "Sample ID", y="Callable bp (billions)") +
    coord_cartesian(ylim = c(2.45, 2.6)) +
    theme_cowplot() +
    theme(axis.text.x = element_blank(), axis.ticks.x = element_blank(), text = element_text(size = 24), axis.text.y = element_text(size=24))

We were able to look for DNMs at ~2.6 billion positions in most children, but closer to ~2.4 billion in others.

Does this variable sequencing depth affect our observed DNM counts?

Explicitly modeling counts as rates

Our current model is:

\[\log(\mu) = \beta_0 + \beta_{1}X_{1}\]

where \(\mu\) is the expected number of DNMs observed in a child given \(X_1\), the age of the child’s father.

We actually want to model: \[\log(\frac{\mu}{N}) = \beta_0 + \beta_{1}X_{1}\]

where \(N\) is the number of genomic positions at which we had enough sequencing data in the trio to detect a mutation.

In other words, we want to model the expected number of DNMs in a child per “callable” base pair as a function of age.

Explicitly modeling counts as rates

Thanks to logarithm rules (remember those?),

\[\log(\frac{\mu}{N}) = \beta_0 + \beta_{1}X_{1}\]

\[\log(\mu) = \beta_0 + \beta_{1}X_{1} + \log(N)\]

In the context of Poisson GLMs, the \(\log(N)\) term is sometimes referred to as “exposure” or “offset.”

Modeling with offsets in R

In R, we can easily update our linear model to account for an “offset.”

Recall that our new model is

\[\log(\mu) = \beta_0 + \beta_{1}X_{1} + \log(N)\]

where \(N\) is the number of callable base pairs per trio.

In R, this looks like:

m = glm(
    all_dnms ~ dad_age + offset(log(callable_bp)), 
    family="poisson", 
    data=dnms,
)

De novo mutation rates are robust to coverage

Expand to see code
m_with_offset = glm(
    all_dnms ~ dad_age + offset(log(autosomal_callable_fraction)), 
    family="poisson", 
    data=dnms,
    )

summary(m_with_offset)

Call:
glm(formula = all_dnms ~ dad_age + offset(log(autosomal_callable_fraction)), 
    family = "poisson", data = dnms)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-2.2407  -0.7998   0.0082   0.7726   2.9893  

Coefficients:
              Estimate Std. Error  z value Pr(>|z|)    
(Intercept) -18.193428   0.080329 -226.487   <2e-16 ***
dad_age       0.024550   0.002644    9.287   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for poisson family taken to be 1)

    Null deviance: 179.845  on 69  degrees of freedom
Residual deviance:  95.811  on 68  degrees of freedom
AIC: 521.3

Number of Fisher Scoring iterations: 4

The dad_age term is still significant, and the coefficient is very similar to its estimate in our “vanilla” Poisson model.

Take-home big picture

When fitting GLMs, you need to answer:

What distribution are my \(Y\) values expected to follow?

This will determine the family you choose in the glm function

How do I make the expected values of \(Y\) compatible with my linear predictor?

This will determine the link you choose

Be careful to interpret coefficients accordingly!

Examples of R methods for various types of data

# simple linear regression, same as lm(y ~ x)
m = glm(y ~ x, family = gaussian(link = "identity"))
# logistic regression
m = glm(y ~ x, family = binomial(link = "logit"))
# poisson regression
m = glm(y ~ x, family = poisson(link = "log"))